Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
In the first decades of the 21stcentury, there has been a global trend towards digitisation and the mobilisation of data from natural history museums and research institutions. The development of national and international aggregator systems, which focused on data standards, made it possible to access millions of museum specimen records. These records serve as an empirical foundation for research across various fields. In addition, community efforts have expanded the concept of natural history collection specimens to include physical preparations and digital resources, resulting in the Digital Extended Specimen (DES), which also includes derived and related data. Within this context, the paper proposes using the FAIR Digital Object (FDO) framework to accelerate the global vision of the DES, arguing that FDO-enabled infrastructures can reduce barriers to the discovery and access of specimens, help ensure credit back to contributors and increase the amount of research that incorporates biodiversity data.more » « less
-
Commonly used data citation practices rely on unverifiable retrieval methods which are susceptible to “content drift”, which occurs when the data associated with an identifier have been allowed to change. Based on our earlier work on reliable dataset identifiers, we propose signed citations, i.e., customary data citations extended to also include a standards-based, verifiable, unique, and fixed-length digital content signature. We show that content signatures enable independent verification of the cited content and can improve the persistence of the citation. Because content signatures are location- and storage-medium-agnostic, cited data can be copied to new locations to ensure their persistence across current and future storage media and data networks. As a result, content signatures can be leveraged to help scalably store, locate, access, and independently verify content across new and existing data infrastructures. Content signatures can also be embedded inside content to create robust, distributed knowledge graphs that can be cited using a single signed citation. We describe real-world applications of signed citations used to cite and compile distributed data collections, cite specific versions of existing data networks, and stabilize associations between URLs and content.more » « less
-
Thanks to substantial support for biodiversity data mobilization in recent decades, billions of occurrence records are openly available, documenting life on Earth and enabling timely research, awareness raising, and policy-making. Initiatives across local to global scales have been separately funded to serve different, yet often overlapping audiences of data users, and have developed a variety of platforms and infrastructures to meet the needs of these audiences. The independent progress of biodiversity data providers has led to innovations as well as challenges for the community at large as we move towards connecting and linking a diversity of information from disparate sources as Digital Extended Specimens (DES). Recognizing a need for deeper and more frequent opportunities for communication and collaboration across the globe, an ad-hoc group of representatives of various international, national, and regional organizations have been meeting virtually since 2020 to provide a forum for updates, announcements, and shared progress. This group is provisionally named International Partners for the Digital Extended Specimen (IPDES), and is guided by these four concepts: Biodiversity, Connection, Knowledge and Agency. Participants in IPDES include representatives of the Global Biodiversity Information Facility (GBIF), Integrated Digitized Biocollections (iDigBio), American Institute of Biological Sciences (AIBS), Biodiversity Collections Network (BCoN), Natural Science Collections Alliance (NSCA), Distributed System of Scientific Collections (DiSSCo), Atlas of Living Australia (ALA), Biodiversity Information Standards (TDWG), Society for the Preservation of Natural History Collections (SPNHC), National Specimen Information Infrastructure of China (NSII), and South African National Biodiversity Institute (SANBI), as well as individuals involved with biodiversity informatics initiatives, natural science collections, museums, herbaria, and universities. Our global partners group strives to increase representation from around the globe as we aim to enable research that contributes to novel discoveries and addresses the societal challenges leading to the biodiversity crisis. Our overarching mission is to expand on the community-driven successes to connect biodiversity data and knowledge through coordination of a globally integrated network of stakeholders to enable an extensible technical and social infrastructure of data, tools, and working practices in support of our vision. The main work of our group thus far includes publishing a paper on the Digital Extended Specimen (Hardisty et al. 2022), organizing and hosting an array of activities at conferences, and asynchronous online work and forum-based exchanges. We aim to advance discussion on topics of broad interest to our community such as social and technical capacity building, broadening participation, expanding social and data networks, improving data models and building a backbone for the DES, and identifying international funding solutions. This presentation will highlight some of these activities and detail progress towards a roadmap for the development of the human network and technical infrastructure necessary to support the DES. It provides an opportunity for feedback from and engagement by stakeholder communities such as TDWG and other initiatives with a focus on data standards and biodiversity informatics, as we solidify our plans for the future in support of integrated and interconnected biodiversity data and credit for those doing the work.more » « less
-
Information Extraction (IE) from imaged text is affected by the output quality of the text-recognition process. Misspelled or missing text may propagate errors or even preclude IE. Low confidence in automated methods is the reason why some IE projects rely exclusively on human work (crowdsourcing). That is the case of biological collections (biocollections), where the metadata (Darwin-core Terms) found in digitized labels are transcribed by citizen scientists. In this paper, we present an approach to reduce the number of crowdsourcing tasks required to obtain the transcription of the text found in biocollections' images. By using an ensemble of Optical Character Recognition (OCR) engines - OCRopus, Tesseract, and the Google Cloud OCR - our approach identifies the lines and characters that have a high probability of being correct. This reduces the need for crowdsourced transcription to be done for only low confidence fragments of text. The number of lines to transcribe is also reduced through hybrid human-machine crowdsourcing where the output of the ensemble of OCRs is used as the first "human" transcription of the redundant crowdsourcing process. Our approach was tested in six biocollections (2,966 images), reducing the number of crowdsourcing tasks by 76% (58% due to lines accepted by the ensemble of OCRs and about 18% due to accelerated convergence when using hybrid crowdsourcing). The automatically extracted text presented a character error rate of 0.001 (0.1%).more » « less
-
DOI 10.1109/COMPSAC.2019.10284 Abstract—Hadoop is a popular data-analytics platform based on the MapReduce model. When analyzing extremely big data, hard disk drives are commonly used and Hadoop performance can be optimized by improving I/O performance. Hard disk drives have different performance depending on whether data are placed in the outer or inner disk zones. In this paper, we propose a method that uses knowledge of job characteristics to place data in hard disk drives so that Hadoop performance is improved. Files of a job that intensively and sequentially accesses the storage device are placed in outer disk tracks which have higher sequential access speed than inner tracks. Temporary and permanent files are placed in the outer and inner zones, respectively. This enables repeated usage of the faster zones by avoiding the use of the faster zones by permanent files. Our evaluation demonstrates that the proposed method improves the performance of Hadoop jobs by 15.0% over the normal case when file placement is not used. The proposed method also outperforms a previously proposed placement approach by 9.9%.more » « less
-
Biological collections store information with broad societal and environmental impact. In the last 15 years, after worldwide investments and crowdsourcing efforts, 25% of the collected specimens have been digitized; a process that includes the imaging of text attached to specimens and subsequent extraction of information from the resulting image. This information extraction (IE) process is complex, thus slow and typically involving human tasks. We propose a hybrid (Human-Machine) information extraction model that efficiently uses resources of different cost (machines, volunteers and/or experts) and speeds up the biocollections' digitization process, while striving to maintain the same quality as human-only IE processes. In the proposed model, called SELFIE, self-aware IE processes determine whether their output quality is satisfactory. If the quality is unsatisfactory, additional or alternative processes that yield higher quality output at higher cost are triggered. The effectiveness of this model is demonstrated by three SELFIE workflows for the extraction of Darwin-core terms from specimens' images. Compared to the traditional human-driven IE approach, SELFIE workflows showed, on average, a reduction of 27% in the information-capture time and a decrease of 32% in the required number of humans and their associated cost, while the quality of the results was negligibly reduced by 0.27%.more » « less
-
Citizen science projects have successfully taken advantage of volunteers to unlock scientific information contained in images. Crowds extract scientific data by completing different types of activities: transcribing text, selecting values from pre-defined options, reading data aloud, or pointing and clicking at graphical elements. While designing crowdsourcing tasks, selecting the best form of input and task granularity is essential for keeping the volunteers engaged and maximizing the quality of the results. In the context of biocollections information extraction, this study compares three interface actions (transcribe, select, and crop) and tasks of different levels of granularity (single field vs. compound tasks). Using 30 crowdsourcing experiments and two different populations, these interface alternatives are evaluated in terms of speed, quality, perceived difficulty and enjoyability. The results show that Selection and Transcription tasks generate high quality output, but they are perceived as boring. Conversely, Cropping tasks, and arguably graphical tasks in general, are more enjoyable, but their output quality depend on additional machine-oriented processing. When the text to be extracted is longer than two or three words, Transcription is slower than Selection and Cropping. When using compound tasks, the overall time required for the crowdsourcing experiment is considerably shorter than using single field tasks, but they are perceived as more difficult. When using single field tasks, both the quality of the output and the amount of identified data are slightly higher compared to compound tasks, but they are perceived by the crowd as less entertaining.more » « less
-
Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus.more » « less
An official website of the United States government
